Skip to content

feat: make embedding model configurable for non-English search#553

Open
oeniao wants to merge 1 commit intoMemPalace:developfrom
oeniao:feat/configurable-embedding-model
Open

feat: make embedding model configurable for non-English search#553
oeniao wants to merge 1 commit intoMemPalace:developfrom
oeniao:feat/configurable-embedding-model

Conversation

@oeniao
Copy link
Copy Markdown

@oeniao oeniao commented Apr 10, 2026

Problem

The default ChromaDB embedding model (all-MiniLM-L6-v2) is English-only. Users writing in Chinese, Japanese, Korean, or other non-Latin languages get poor search results — match scores go negative and retrieved memories are irrelevant.

Solution

Add an embedding_model config option that lets users opt into any HuggingFace sentence-transformers model without patching installed files.

Configuration (priority order):

  1. Env var: MEMPALACE_EMBEDDING_MODEL=paraphrase-multilingual-MiniLM-L12-v2
  2. Config file: ~/.mempalace/config.json{"embedding_model": "paraphrase-multilingual-MiniLM-L12-v2"}
  3. Default: null → uses ChromaDB built-in (unchanged behaviour)

Example for Chinese/Japanese/Korean users:

{
  "embedding_model": "paraphrase-multilingual-MiniLM-L12-v2"
}

Changes

  • config.py: added DEFAULT_EMBEDDING_MODEL = None and embedding_model property with env var support (MEMPALACE_EMBEDDING_MODEL)
  • palace.py: added _get_embedding_function() helper that reads config and lazily imports sentence-transformers only when needed

Backward compatibility

  • Default is null → behaviour identical to before
  • sentence-transformers is not a new required dependency — only needed when embedding_model is set
  • Existing palaces continue to work without any migration

Motivation

Discovered this while setting up MemPalace for daily use in Chinese. Had to patch the installed package directly — this PR makes it a proper first-class config option.

Add an `embedding_model` config option that lets users switch from the
default ChromaDB built-in (all-MiniLM-L6-v2, English-only) to any
HuggingFace sentence-transformers model.

This makes non-English search viable without patching installed files.
For example, setting `paraphrase-multilingual-MiniLM-L12-v2` improves
Chinese/Japanese/Korean recall significantly.

Configuration (priority order):
1. Env var:  MEMPALACE_EMBEDDING_MODEL=paraphrase-multilingual-MiniLM-L12-v2
2. Config:   ~/.mempalace/config.json → {"embedding_model": "..."}
3. Default:  null (uses ChromaDB built-in, no extra dependencies)

The `sentence-transformers` package is only required when a custom model
is configured; the default path is unchanged and fully backward-compatible.
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fills a real gap — the English-only ONNX default is a genuine pain point for non-Latin users, and patching installed files is not a reasonable workaround.

Design is sound. The layered config priority (env var → file → default) is consistent with how the rest of the codebase handles config. The lazy sentence-transformers import is exactly right: no new hard dependency, clean ImportError message telling the user exactly what to do.

One correctness concern worth flagging: _get_embedding_function() creates a new MempalaceConfig() instance on every call (and get_collection() calls it on every operation). That's minor overhead but not a real problem. The larger issue is that get_collection() passes the embedding function to both get_collection and create_collection — but ChromaDB ties the embedding function to the collection at creation time. If a palace was created with the default model and is later opened with MEMPALACE_EMBEDDING_MODEL set (or vice versa), ChromaDB will silently accept the mismatch and your embeddings become incoherent (new writes land in a different region of embedding space from old ones).

The guard against this would be a warning or error when the configured model doesn't match what the collection was created with — or at minimum a note in the docs that changing embedding_model on an existing palace requires re-mining.

Minor: no tests for the new config property or the _get_embedding_function() path. Even a couple of unit tests with a mock model name would catch regressions.

The core feature is right and the backward-compatibility story is solid (null default = unchanged behaviour). The model-mismatch case is the one thing worth addressing before merge — even just a doc note would be better than nothing.


[MemPalace-AGI integration — production stats at https://milla-jovovich.github.io/mempalace/integrations/mempalace-agi/]

@web3guru888
Copy link
Copy Markdown

This is the right fix for #516 — and the design is clean.

The layered config priority (env var → config.json → default) is the correct approach. Env var support means CI/CD and containerized setups can override without touching config files, and the config.json path keeps it user-friendly for interactive use.

A few notes from running non-English content through our integration:

Backward-compat risk: If a user sets embedding_model on an existing palace, the collection's embedding space changes but the stored vectors don't. ChromaDB will silently accept the new function and produce wrong similarity scores — or fail with a dimension mismatch if the model has different output dims. Worth adding a note in the config docstring (and ideally a startup check) that this setting must be decided before first use. A warning like ⚠ Existing palace detected — ensure embedding_model matches the model used at creation would save people pain.

Empty string edge case: config.py:L164return env_val or None correctly converts empty string to None, but if env_val: above it would return "" which is falsy and would fall through to the config file check anyway. The logic is fine but slightly redundant — env_val = os.environ.get(...) or None at the top would be clearer.

Missing get_collection call sites: The _get_embedding_function() helper is only wired into get_collection(). But there are a few call sites in the codebase that construct their own client directly (search, exporter). Worth confirming those all go through get_collection so the embedding function is consistently applied.

The lazy import for sentence-transformers with the clear error message is exactly the right pattern — no silent dependency pollution for English-only users.

Overall this unblocks a real pain point for non-English users. LGTM with the backward-compat caveat addressed.

@mvalentsev
Copy link
Copy Markdown
Contributor

#442 covers the same feature (configurable embedding model via env var / config.json) with a broader scope -- it patches all 7 ChromaDB consumer modules, adds mismatch detection, and includes a test suite. Both PRs touch config.py and palace.py for the same purpose so they'll conflict.

@web3guru888
Copy link
Copy Markdown

Good callout @mvalentsev — yes, both PRs target the same config surface and will conflict.

The scopes are genuinely different though: this PR (#553) is minimal — configurable model for the primary search + mine path. #442 goes wider (all 7 ChromaDB consumers + mismatch detection + a test suite). Both are valid approaches.

Whether to merge #442 first, close this one, or merge this as a stopgap probably depends on #442's merge timeline and maintainer preference. If #442 is close to ready, that's likely the better merge path given the broader coverage. If #442 is stalled, this is a fast unblock for non-English users.

Either way, @oeniao it would be worth opening a quick coordination comment on #442 to flag the conflict — or offering to pull in mvalentsev's mismatch detection test idea into this PR if #442 ends up waiting.


[MemPalace-AGI integration — 215 tests, 710 KG entities]

@NickShtefan
Copy link
Copy Markdown

Hey @oeniao — thanks for tackling this, non-English search is definitely a pain point.

Just a heads-up: #442 addresses the same problem with broader coverage — all 7 ChromaDB consumer modules are patched (not just palace.py), plus it includes embedding model mismatch detection, a re-mine command for model switching, configurable chunk size, device support (MPS/CUDA), and a test suite. It was just rebased on latest main today.

One note on the model choice: paraphrase-multilingual-MiniLM-L12-v2 has a 128-token context window which truncates longer chunks. I tested it and switched to intfloat/multilingual-e5-base (512 tokens) — much better for real-world content.

Since both PRs touch the same config surface and will conflict, it probably makes sense to coordinate. Happy to discuss!

@bensig bensig changed the base branch from main to develop April 11, 2026 22:21
mvalentsev added a commit to mvalentsev/mempalace that referenced this pull request Apr 12, 2026
ORT_DISABLE_COREML is not a recognized ONNX Runtime environment
variable. ONNX Runtime does not expose a global env var to disable
individual execution providers -- providers are selected per session
via the providers argument to InferenceSession. Setting it had zero
effect, so the CoreMLExecutionProvider was still loaded on Apple
Silicon and the segfault from MemPalace#74 was not actually mitigated.

Replace the misleading setdefault call with a comment that records
the history and points at the real fix path. The proper CoreML
workaround requires passing preferred_providers at the ChromaDB
embedding function level, which is a larger change that belongs in
its own PR once the configurable embedding model work in MemPalace#442 /
MemPalace#553 lands.

Closes MemPalace#397
@igorls igorls added area/search Search and retrieval enhancement New feature or request labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/search Search and retrieval enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants